Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Recognition of printed arabic text based on global features and decision tree learning techniques

Identifieur interne : 009D16 ( Main/Exploration ); précédent : 009D15; suivant : 009D17

Recognition of printed arabic text based on global features and decision tree learning techniques

Auteurs : Adnan Amin [Australie]

Source :

RBID : ISTEX:075C2186877896091C7ED5D01E963553ED179886

English descriptors

Abstract

Abstract: Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.

Url:
DOI: 10.1016/S0031-3203(99)00114-4


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Recognition of printed arabic text based on global features and decision tree learning techniques</title>
<author>
<name sortKey="Amin, Adnan" sort="Amin, Adnan" uniqKey="Amin A" first="Adnan" last="Amin">Adnan Amin</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:075C2186877896091C7ED5D01E963553ED179886</idno>
<date when="2000" year="2000">2000</date>
<idno type="doi">10.1016/S0031-3203(99)00114-4</idno>
<idno type="url">https://api.istex.fr/ark:/67375/6H6-75HWFV2G-B/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000157</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000157</idno>
<idno type="wicri:Area/Istex/Curation">000157</idno>
<idno type="wicri:Area/Istex/Checkpoint">001F68</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">001F68</idno>
<idno type="wicri:doubleKey">0031-3203:2000:Amin A:recognition:of:printed</idno>
<idno type="wicri:Area/Main/Merge">00A296</idno>
<idno type="wicri:Area/Main/Curation">009D16</idno>
<idno type="wicri:Area/Main/Exploration">009D16</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">Recognition of printed arabic text based on global features and decision tree learning techniques</title>
<author>
<name sortKey="Amin, Adnan" sort="Amin, Adnan" uniqKey="Amin A" first="Adnan" last="Amin">Adnan Amin</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Australie</country>
<wicri:regionArea>School of Computer Science and Engineering, University of New South Wales, 2052 Sydney</wicri:regionArea>
<placeName>
<settlement type="city">Sydney</settlement>
<region type="état">Nouvelle-Galles du Sud</region>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Australie</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Pattern Recognition</title>
<title level="j" type="abbrev">PR</title>
<idno type="ISSN">0031-3203</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="2000">2000</date>
<biblScope unit="volume">33</biblScope>
<biblScope unit="issue">8</biblScope>
<biblScope unit="page" from="1309">1309</biblScope>
<biblScope unit="page" to="1323">1323</biblScope>
</imprint>
<idno type="ISSN">0031-3203</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0031-3203</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="Teeft" xml:lang="en">
<term>Adequate support</term>
<term>Algorithm</term>
<term>Amin</term>
<term>Amin pattern recognition</term>
<term>Arabic</term>
<term>Arabic character recognition</term>
<term>Arabic characters</term>
<term>Arabic letter</term>
<term>Arabic text</term>
<term>Arabic word</term>
<term>Arabic words</term>
<term>Automatic recognition</term>
<term>Baseline</term>
<term>Binary image</term>
<term>Black pixels</term>
<term>Block diagram</term>
<term>Character recognition</term>
<term>Comparaison dynamique</term>
<term>Complementary character</term>
<term>Complementary characters</term>
<term>Complete representation</term>
<term>Computer recognition</term>
<term>Computer science</term>
<term>Cursive</term>
<term>Cursive nature</term>
<term>Decision tree</term>
<term>Document analysis</term>
<term>Elsevier science</term>
<term>Error rate</term>
<term>Error rates performance</term>
<term>Feature extraction</term>
<term>First kuwait computer conference</term>
<term>Global features</term>
<term>Handprinted characters</term>
<term>Handwriting recognition</term>
<term>Handwritten</term>
<term>Handwritten characters</term>
<term>Horizontal projection</term>
<term>Ieee</term>
<term>Ieee trans</term>
<term>Inner contours</term>
<term>Intensive research</term>
<term>International conference</term>
<term>Japanese characters</term>
<term>Large number</term>
<term>Leaf node</term>
<term>Leaf nodes</term>
<term>Line segment</term>
<term>Machine recognition</term>
<term>Markov models</term>
<term>National computer conference</term>
<term>Node</term>
<term>Other utilities</term>
<term>Pattern recognition</term>
<term>Pattern recognition society</term>
<term>Pixel</term>
<term>Previous scanline</term>
<term>Recognition process</term>
<term>Recognition rate</term>
<term>Same font</term>
<term>Same shape</term>
<term>Segmentation</term>
<term>Segmentation stage</term>
<term>Subwords</term>
<term>Successive scanlines</term>
<term>Symbolic machine</term>
<term>Syntactical pattern recognition</term>
<term>Technical papers</term>
<term>Text recognition</term>
<term>Uniform population</term>
<term>Vowel diacritics</term>
<term>White pixels</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Australie</li>
</country>
<region>
<li>Nouvelle-Galles du Sud</li>
</region>
<settlement>
<li>Sydney</li>
</settlement>
</list>
<tree>
<country name="Australie">
<region name="Nouvelle-Galles du Sud">
<name sortKey="Amin, Adnan" sort="Amin, Adnan" uniqKey="Amin A" first="Adnan" last="Amin">Adnan Amin</name>
</region>
<name sortKey="Amin, Adnan" sort="Amin, Adnan" uniqKey="Amin A" first="Adnan" last="Amin">Adnan Amin</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 009D16 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 009D16 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:075C2186877896091C7ED5D01E963553ED179886
   |texte=   Recognition of printed arabic text based on global features and decision tree learning techniques
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022